L0 Regularization
نویسنده
چکیده
We propose a practical method for L0 norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization. AIC and BIC, well-known model selection criteria, are special cases of L0 regularization. However, since the L0 norm of weights is non-differentiable, we cannot incorporate it directly as a regularization term in the objective function. We propose a solution through the inclusion of a collection of non-negative stochastic gates, which collectively determine which weights to set to zero. We show that, somewhat surprisingly, for certain distributions over the gates, the expected L0 regularized objective is differentiable with respect to the distribution parameters. We further propose the hard concrete distribution for the gates, which is obtained by “stretching” a binary concrete distribution and then transforming its samples with a hard-sigmoid. The parameters of the distribution over the gates can then be jointly optimized with the original network parameters. As a result our method allows for straightforward and efficient learning of model structures with stochastic gradient descent and allows for conditional computation in a principled way. We perform various experiments to demonstrate the effectiveness of the resulting approach and regularizer.
منابع مشابه
The Florida State University College of Arts and Sciences Theories on Group Variable Selection in Multivariate Regression Models
We study group variable selection on multivariate regression model. Group variable selection is selecting the non-zero rows of coefficient matrix, since there are multiple response variables and thus if one predictor is irrelevant to estimation then the corresponding row must be zero. In a high dimensional setup, shrinkage estimation methods are applicable and guarantee smaller MSE than OLS acc...
متن کاملA risk ratio comparison of L0 and L1 penalized regression
In the past decade, there has been an explosion of interest in using l1-regularization in place of l0-regularization for feature selection. We present theoretical results showing that while l1-penalized linear regression never outperforms l0-regularization by more than a constant factor, in some cases using an l1 penalty is infinitely worse than using an l0 penalty. We also compare algorithms f...
متن کاملThe l0-norm-based Blind Image Deconvolution: Comparison and Inspiration
Single image blind deblurring has been intensively studied since Fergus et al.’s variational Bayes method in 2006. It is now commonly believed that the blurkernel estimation accuracy is highly dependent on the pursed salient edge information from the blurred image, which stimulates numerous l0-approximating blind deblurring methods via kinds of techniques and tricks. This paper, however, focuse...
متن کاملQuantitative susceptibility imaging with homotopic L0 minimization programming: preliminary study of brain
INTRODUCTION Susceptibility-weighted imaging (SWI) technique is used for neuroimaging to improve visibility of iron deposits, veins, and hemorrhage [1]. Quantitative susceptibility imaging (QSI) improves upon SWI by measuring iron in tissues, which can be useful for molecular/cellular imaging to analyze brain function, diagnose neurological diseases, and quantify contrast agent concentrations. ...
متن کاملEfficient Regularized Regression for Variable Selection with L0 Penalty
Variable (feature, gene, model, which we use interchangeably) selections for regression with high-dimensional BIGDATA have found many applications in bioinformatics, computational biology, image processing, and engineering. One appealing approach is the L0 regularized regression which penalizes the number of nonzero features in the model directly. L0 is known as the most essential sparsity meas...
متن کاملLearning Sparse Neural Networks through L0 Regularization
We propose a practical method for L0 norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization. AIC and BIC, well-known model selection criteria, are special cases of L0 regularization. However, since...
متن کامل